Here, we'll examine geodemographic clustering in Los Angeles County
%load_ext watermark
%watermark -v -a "author: eli knaap" -d -u -p segregation,libpysal,geopandas
Author: author: eli knaap Last updated: 2021-12-20 Python implementation: CPython Python version : 3.7.10 IPython version : 7.28.0 segregation: 2.1.0 libpysal : 4.5.1 geopandas : 0.10.1
import geopandas as gpd
from libpysal import weights
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import geoviews as gv
import hvplot.pandas
import seaborn as sns
import matplotlib.pyplot as plt
gv.extension('matplotlib', 'bokeh')
gv.output(backend='bokeh')
scag = gpd.read_file("data/scag_region.gpkg", layer="tracts")
scag = scag.fillna(0)
scag.plot()
<AxesSubplot:>
la = scag[scag.geoid.str[:5]=='06037']
wq = weights.Queen.from_dataframe(la)
la = la.iloc[wq.component_labels==0]
la.plot()
<AxesSubplot:>
Geodemographic analysis, which includes applying unsupervised learning to demographic and socioeconomic data, followed by a spatial analysis of the results
columns = ['median_household_income', 'median_home_value', 'p_asian_persons', 'p_hispanic_persons', 'p_nonhisp_black_persons', 'p_nonhisp_white_persons']
scaler = StandardScaler()
la_kmeans = KMeans(n_clusters=6).fit(scaler.fit_transform(la[columns]))
la_kmeans.labels_
array([0, 0, 2, ..., 0, 1, 0], dtype=int32)
la['kmeans'] = la_kmeans.labels_
la.hvplot(c='kmeans', cmap='tab10', line_width=0.1, alpha=0.7, geo=True, tiles='CartoLight', xaxis=False, yaxis=False, height=500, colorbar=False)
WARNING:param.GeoOverlayPlot06150: Due to internal constraints, when aspect and width/height is set, the bokeh backend uses those values as frame_width/frame_height instead. This ensures the aspect is respected, but means that the plot might be slightly larger than anticipated. Set the frame_width/frame_height explicitly to suppress this warning.
There are some obvious spatial patterns (which we might expect, given the results of our prior esda and segregation analyses). But what do these clusters mean? What kinds of demographic features do they represent?
la.groupby('kmeans')[columns].mean()
| median_household_income | median_home_value | p_asian_persons | p_hispanic_persons | p_nonhisp_black_persons | p_nonhisp_white_persons | |
|---|---|---|---|---|---|---|
| kmeans | ||||||
| 0 | 72753.862414 | 507302.751334 | 12.982233 | 28.473489 | 4.709712 | 50.421712 |
| 1 | 38835.390419 | 281580.955956 | 5.755458 | 54.838507 | 20.976398 | 11.005194 |
| 2 | 67302.643864 | 510752.177318 | 49.649554 | 29.503520 | 3.439874 | 15.103184 |
| 3 | 51725.409740 | 377916.454231 | 3.426007 | 28.030378 | 59.877553 | 5.300393 |
| 4 | 48678.792511 | 355693.567665 | 6.985285 | 80.423688 | 3.120991 | 8.431325 |
| 5 | 121651.186252 | 917421.435783 | 11.487458 | 10.867842 | 2.970364 | 70.883407 |
This table is a lot to interpret at once, so a visualization would be handy. Violin plots are a nice way of examining how each of the input variables is distributed in each of the resulting clusters
sns.set_style('whitegrid')
fig, ax = plt.subplots(3,2, figsize=(16,8))
ax=ax.flatten()
for i, col in enumerate(columns):
sns.violinplot(data=la, y=col, x=la.kmeans, ax=ax[i])
ax[i].set_title(col.replace("_", " ").title())
plt.tight_layout()
We can also use a statistic to tell us how well this model fits the data. To do so, we can use scikit-learn's silhouette score
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.
This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
from sklearn.metrics import silhouette_score
silhouette_score(scaler.fit_transform(la[columns]), la_kmeans.labels_)
0.32177879800195
What about other clustering algorithms or other numbers for k? Might we get a better model fit?
la_affprop = AffinityPropagation(damping=0.8, preference=-1000,).fit(scaler.fit_transform(la[columns]))
la_affprop.labels_
array([1, 1, 0, ..., 1, 3, 1])
silhouette_score(scaler.fit_transform(la[columns]), la_affprop.labels_)
0.3604191994148679
la['affprop'] = la_affprop.labels_
This will create a linked holoviews plot so we can zoom in on both maps together (click the "wheel zoom" button on the bokeh plot so you can zoom in)
la.hvplot(c='affprop', cmap='tab10', line_width=0.1, alpha=0.7, geo=True, tiles='CartoLight', xaxis=False, yaxis=False, colorbar=False, title='Affinity Prop') + \
la.hvplot(c='kmeans', cmap='tab10', line_width=0.1, alpha=0.7, geo=True, tiles='CartoLight', xaxis=False, yaxis=False, colorbar=False, title='K-Means')
The silhouette score tells us that the affinity propagation clusterer provided a better solution. Nonetheless, we end up with similar spatial patterns
Above, we notice there are some obvious spatial patterns in the neighborhood clusters. That happens due to the underlying spatial autocorrelation in the race and class indicators we used to develop the clusters. Instead of allowing this autocorrelation to "fall out" of the results, we can leverage it to create spatially-contiguous clusters
scikit-learn's agglomerative clustering algorithm allows us to pass a constraint and it accepts a pysal W object. Lets compare solutions with and without the constraint
w = weights.Queen.from_dataframe(la)
la_ward = AgglomerativeClustering(n_clusters=8, linkage='ward').fit(scaler.fit_transform(la[columns]))
la['ward'] = la_ward.labels_
la.groupby('ward')[columns].median()
| median_household_income | median_home_value | p_asian_persons | p_hispanic_persons | p_nonhisp_black_persons | p_nonhisp_white_persons | |
|---|---|---|---|---|---|---|
| ward | ||||||
| 0 | 58234.938202 | 4.082478e+05 | 13.006862 | 57.344708 | 3.104464 | 18.508217 |
| 1 | 121843.851124 | 1.088952e+06 | 8.052817 | 7.435296 | 1.749524 | 76.020506 |
| 2 | 85898.657303 | 6.101394e+05 | 12.389735 | 20.232354 | 2.826710 | 56.736570 |
| 3 | 55863.202247 | 2.659219e+05 | 6.707317 | 23.647554 | 5.884413 | 49.844042 |
| 4 | 42632.443820 | 3.266854e+05 | 2.596032 | 86.811146 | 1.379707 | 3.560149 |
| 5 | 71669.875000 | 5.563997e+05 | 58.589877 | 21.144737 | 1.311037 | 13.963383 |
| 6 | 40265.063670 | 3.183005e+05 | 3.785644 | 59.277575 | 23.795872 | 5.508182 |
| 7 | 48770.862360 | 3.718769e+05 | 1.736527 | 25.000000 | 58.754129 | 2.723888 |
la.plot('ward', categorical=True)
<AxesSubplot:>
la_ward_spatial = AgglomerativeClustering(n_clusters=8, linkage='ward', connectivity=w.sparse).fit(scaler.fit_transform(la[columns]))
la['ward_spatial'] = la_ward_spatial.labels_
la.hvplot(c='ward', cmap='tab10', line_width=0.1, alpha=0.7, geo=True, tiles='CartoLight', xaxis=False, yaxis=False, frame_height=450, colorbar=False) + \
la.hvplot(c='ward_spatial', cmap='tab10', line_width=0.1, alpha=0.7, geo=True, tiles='CartoLight', xaxis=False, yaxis=False, frame_height=450, colorbar=False)
silhouette_score(scaler.fit_transform(la[columns]), la_ward.labels_)
silhouette_score(scaler.fit_transform(la[columns]), la_ward_spatial.labels_)
Why is the silhouette score higher for the first soluttion?
damping=0.8 and preference=-100#%load solutions/05.py